reference sentence
Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation
Barančíková, Petra, Bojar, Ondřej
In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into 'operationalizable semantics' in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
- Europe > Italy > Tuscany > Florence (0.04)
- (15 more...)
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
Albadarneh, Israa A., Hammo, Bassam H., Al-Kadi, Omar S.
Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.
- Asia > Myanmar (0.05)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology (1.00)
- Media (0.92)
- Health & Medicine > Therapeutic Area > Neurology (0.45)
Labeling Sentences with Symbolic and Deictic Gestures via Semantic Similarity
Gjaci, Ariel, Recchiuto, Carmine Tommaso, Sgorbissa, Antonio
Co-speech gesture generation on artificial agents has gained attention recently, mainly when it is based on data-driven models. However, end-to-end methods often fail to generate co-speech gestures related to semantics with specific forms, i.e., Symbolic and Deictic gestures. In this work, we identify which words in a sentence are contextually related to Symbolic and Deictic gestures. Firstly, we appropriately chose 12 gestures recognized by people from the Italian culture, which different humanoid robots can reproduce. Then, we implemented two rule-based algorithms to label sentences with Symbolic and Deictic gestures. The rules depend on the semantic similarity scores computed with the RoBerta model between sentences that heuristically represent gestures and sub-sentences inside an objective sentence that artificial agents have to pronounce. We also implemented a baseline algorithm that assigns gestures without computing similarity scores. Finally, to validate the results, we asked 30 persons to label a set of sentences with Deictic and Symbolic gestures through a Graphical User Interface (GUI), and we compared the labels with the ones produced by our algorithms. For this scope, we computed Average Precision (AP) and Intersection Over Union (IOU) scores, and we evaluated the Average Computational Time (ACT). Our results show that semantic similarity scores are useful for finding Symbolic and Deictic gestures in utterances.
- Europe > Italy (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- (3 more...)
Proofread: Fixes All Errors with One Tap
Liu, Renjie, Zhang, Yanxiang, Zhu, Yun, Sun, Haicheng, Zhang, Yuanbo, Huang, Michael Xuelin, Cai, Shanqing, Meng, Lei, Zhai, Shumin
The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56\% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \href{https://youtu.be/4ZdcuiwFU7I}{Youtube}.
What Have We Achieved on Non-autoregressive Translation?
Li, Yafu, Zhang, Huajian, Yan, Jianhao, Yin, Yongjing, Zhang, Yue
Recent advances have made non-autoregressive (NAT) translation comparable to autoregressive methods (AT). However, their evaluation using BLEU has been shown to weakly correlate with human annotations. Limited research compares non-autoregressive translation and autoregressive translation comprehensively, leaving uncertainty about the true proximity of NAT to AT. To address this gap, we systematically evaluate four representative NAT methods across various dimensions, including human evaluation. Our empirical results demonstrate that despite narrowing the performance gap, state-of-the-art NAT still underperforms AT under more reliable evaluation metrics. Furthermore, we discover that explicitly modeling dependencies is crucial for generating natural language and generalizing to out-of-distribution sequences.
- North America > United States > California > Alameda County > Oakland (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe > North Sea (0.04)
- (25 more...)
- Law Enforcement & Public Safety (0.68)
- Leisure & Entertainment > Sports (0.46)
Does Dependency Locality Predict Non-canonical Word Order in Hindi?
Ranjan, Sidharth, van Schijndel, Marten
Previous work has shown that isolated non-canonical sentences with Object-before-Subject (OSV) order are initially harder to process than their canonical counterparts with Subject-before-Object (SOV) order. Although this difficulty diminishes with appropriate discourse context, the underlying cognitive factors responsible for alleviating processing challenges in OSV sentences remain a question. In this work, we test the hypothesis that dependency length minimization is a significant predictor of non-canonical (OSV) syntactic choices, especially when controlling for information status such as givenness and surprisal measures. We extract sentences from the Hindi-Urdu Treebank corpus (HUTB) that contain clearly-defined subjects and objects, systematically permute the preverbal constituents of those sentences, and deploy a classifier to distinguish between original corpus sentences and artificially generated alternatives. The classifier leverages various discourse-based and cognitive features, including dependency length, surprisal, and information status, to inform its predictions. Our results suggest that, although there exists a preference for minimizing dependency length in non-canonical corpus sentences amidst the generated variants, this factor does not significantly contribute in identifying corpus sentences above and beyond surprisal and givenness measures. Notably, discourse predictability emerges as the primary determinant of constituent-order preferences. These findings are further supported by human evaluations involving 44 native Hindi speakers. Overall, this work sheds light on the role of expectation adaptation in word-ordering decisions. We conclude by situating our results within the theories of discourse production and information locality.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Brandenburg > Potsdam (0.05)
- Asia > Middle East > Kuwait (0.05)
- (13 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Work Smarter...Not Harder: Efficient Minimization of Dependency Length in SOV Languages
Ranjan, Sidharth, von der Malsburg, Titus
Dependency length minimization is a universally observed quantitative property of natural languages. However, the extent of dependency length minimization, and the cognitive mechanisms through which the language processor achieves this minimization remain unclear. This research offers mechanistic insights by postulating that moving a short preverbal constituent next to the main verb explains preverbal constituent ordering decisions better than global minimization of dependency length in SOV languages. This approach constitutes a least-effort strategy because it's just one operation but simultaneously reduces the length of all preverbal dependencies linked to the main verb. We corroborate this strategy using large-scale corpus evidence across all seven SOV languages that are prominently represented in the Universal Dependency Treebank. These findings align with the concept of bounded rationality, where decision-making is influenced by 'quick-yet-economical' heuristics rather than exhaustive searches for optimal solutions. Overall, this work sheds light on the role of bounded rationality in linguistic decision-making and language evolution.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- (5 more...)
- Research Report > New Finding (0.69)
- Research Report > Experimental Study (0.69)
Metaphor Understanding Challenge Dataset for LLMs
Tong, Xiaoyu, Choenni, Rochelle, Lewis, Martha, Shutova, Ekaterina
Metaphors in natural language are a reflection of fundamental cognitive processes such as analogical reasoning and categorisation, and are deeply rooted in everyday communication. Metaphor understanding is therefore an essential task for large language models (LLMs). We release the Metaphor Understanding Challenge Dataset (MUNCH), designed to evaluate the metaphor understanding capabilities of LLMs. The dataset provides over 10k paraphrases for sentences containing metaphor use, as well as 1.5k instances containing inapt paraphrases. The inapt paraphrases were carefully selected to serve as control to determine whether the model indeed performs full metaphor interpretation or rather resorts to lexical similarity. All apt and inapt paraphrases were manually annotated. The metaphorical sentences cover natural metaphor uses across 4 genres (academic, news, fiction, and conversation), and they exhibit different levels of novelty. Experiments with LLaMA and GPT-3.5 demonstrate that MUNCH presents a challenging task for LLMs. The dataset is freely accessible at https://github.com/xiaoyuisrain/metaphor-understanding-challenge.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (15 more...)
Neural paraphrasing by automatically crawled and aligned sentence pairs
Globo, Achille, Trevisi, Antonio, Zugarini, Andrea, Rigutini, Leonardo, Maggini, Marco, Melacci, Stefano
Paraphrasing is the task of re-writing an input text using other words, without altering the meaning of the original content. Conversational systems can exploit automatic paraphrasing to make the conversation more natural, e.g., talking about a certain topic using different paraphrases in different time instants. Recently, the task of automatically generating paraphrases has been approached in the context of Natural Language Generation (NLG). While many existing systems simply consist in rule-based models, the recent success of the Deep Neural Networks in several NLG tasks naturally suggests the possibility of exploiting such networks for generating paraphrases. However, the main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases, that are needed to efficiently train the neural models. In this paper we present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles. We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences. The data generation process is evaluated in the case of the Italian language, performing experiments using pointer-based deep neural architectures.
- Europe > United Kingdom (0.05)
- North America > United States (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
An Examination of the Compositionality of Large Generative Vision-Language Models
Ma, Teli, Li, Rong, Liang, Junwei
With the success of Large Language Models (LLMs), a surge of Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. The tuning recipe substantially deviates from the common contrastive vision-language learning. However, the performance of GVLMs in multimodal compositional reasoning remains largely unexplored, as existing evaluation metrics and benchmarks focus predominantly on assessing contrastive models like CLIP. In this paper, we examine the potential evaluation metrics to assess the GVLMs and hypothesize generative score methods are suitable for evaluating compositionality. In addition, current benchmarks tend to prioritize syntactic correctness over semantics. The presence of morphological bias in these benchmarks can be exploited by GVLMs, leading to ineffective evaluations. To combat this, we define a MorphoBias Score to quantify the morphological bias and propose a novel LLM-based strategy to calibrate the bias. Moreover, a challenging task is added to evaluate the robustness of GVLMs against inherent inclination toward syntactic correctness. We include the calibrated dataset and the task into a new benchmark, namely MOrphologicall De-biased Benchmark (MODE). Our study provides the first unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. We will release our code and datasets.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Hong Kong (0.04)
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
- (3 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.34)
- Government > Military > Air Force (0.93)
- Aerospace & Defense (0.93)
- Leisure & Entertainment > Sports (0.67)
- Education (0.66)